– We successfully installed R-studio and familiarized ourselves with its interface.
– We successfully learned how to visualize data as desired (including data cleaning and manipulation).
For Part 3, we’ll dive into why the Hawks lost.
Last session we are able to create the figure that’s showing how score changes over time. That’s an observation with the result, when you show this figure RM will ask you “why is that happened?”. So let’s answer his question.
We’ve seen that the Hawks lost the game. As a curious researcher, you might wonder why they lost or which quarter was their downfall against the Knicks. Let’s analyze their shot selection (2-pointers like dunks, jump shots, and layups, or 3-pointers) and success rates throughout each quarter. This will help us identify potential weaknesses.
# Load the tidyverse package (game)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the ggplot2 package (game)
library(ggplot2)
# Load the ggpmisc package (game)
library(ggpmisc)
## Warning: package 'ggpmisc' was built under R version 4.3.3
## Loading required package: ggpp
## Registered S3 methods overwritten by 'ggpp':
## method from
## heightDetails.titleGrob ggplot2
## widthDetails.titleGrob ggplot2
##
## Attaching package: 'ggpp'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
##
## Registered S3 method overwritten by 'ggpmisc':
## method from
## as.character.polynomial polynom
# Load the gganimate package (game)
library(gganimate)
# Load the animation package (game)
library(animation)
# Load the animation package (game)
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
## bring the dataframe as "NBA_19_20_SAF_March_11" with readRDS() function
NBA_19_20_SAF_March_11 <- readRDS("~/Downloads/NBA_19_20_SAF_March_11.rds")
Let’s determine the shot frequency for each shot type (excluding free throws) based on the ShotType and ShotOutcome columns.
## Count how many shots and results they tried per 2pts - dunk, jump shot, layup or 3pts
shot_type_with_outcome_count <- NBA_19_20_SAF_March_11 %>% ## from NBA_19_20_SAF_March_11 data
group_by(ShotType, ShotOutcome) %>%
summarize(count = n()) ## we are going to count how many events happened based on two columns (ShotType, ShotOutcome)
## `summarise()` has grouped output by 'ShotType'. You can override using the
## `.groups` argument.
## Count how many shots they tried per 2pts - dunk, jump shot, layup or 3pts
shot_type_count <- NBA_19_20_SAF_March_11 %>%
group_by(ShotType) %>%
summarize(total_count = n()) ## we are going to count how many events happened based on ShotType columns
compute_the_shot_rate <- left_join(shot_type_with_outcome_count, shot_type_count, by = "ShotType")
compute_the_shot_rate$rate <- compute_the_shot_rate$count / compute_the_shot_rate$total_count *100
compute_the_shot_rate
## # A tibble: 9 × 5
## # Groups: ShotType [5]
## ShotType ShotOutcome count total_count rate
## <chr> <chr> <int> <int> <dbl>
## 1 "" "" 332 332 100
## 2 "2-pt dunk" "make" 20 23 87.0
## 3 "2-pt dunk" "miss" 3 23 13.0
## 4 "2-pt jump shot" "make" 31 68 45.6
## 5 "2-pt jump shot" "miss" 37 68 54.4
## 6 "2-pt layup" "make" 22 40 55
## 7 "2-pt layup" "miss" 18 40 45
## 8 "3-pt jump shot" "make" 25 69 36.2
## 9 "3-pt jump shot" "miss" 44 69 63.8
While this code provides overall shot rates, we are interested in analyzing team-specific performance.
To start, let’s separate the data into Knicks and Hawks plays using the subset() function. The logic is that if the AwayPlay column is not empty, it’s a Knicks play.
play_by_nyk <- subset(NBA_19_20_SAF_March_11, NBA_19_20_SAF_March_11$AwayPlay != "" )
play_by_atl <- subset(NBA_19_20_SAF_March_11, NBA_19_20_SAF_March_11$HomePlay != "" )
Verify that the data is correctly separated:
unique(play_by_nyk$HomePlay)
## [1] ""
unique(play_by_atl$AwayPlay)
## [1] ""
We’ll now calculate shot rates for both teams. However, manually repeating this process for multiple teams is inefficient. In the next step, we’ll create a function to automate the calculation.
Let’s create a function named compute_shot_rate_per_quarter to efficiently calculate shot rates for each quarter. This function takes a DataFrame (df) as input and returns the calculated shot rates.
# create a compute the success rate
compute_shot_rate_per_quarter <- function(df){
## Count how many shots and results they tried per 2pts - dunk, jump shot, layup or 3pts
shot_type_with_outcome_count <- df %>% ## from NBA_19_20_SAF_March_11 data
group_by(Quarter, ShotType, ShotOutcome) %>%
summarize(count = n()) ## we are going to count how many events happened based on two columns (ShotType, ShotOutcome)
## Count how many shots they tried per 2pts - dunk, jump shot, layup or 3pts
shot_type_count <- df %>%
group_by(Quarter, ShotType) %>%
summarize(total_count = n()) ## we are going to count how many events happened based on ShotType columns
compute_the_shot_rate <- left_join(shot_type_with_outcome_count, shot_type_count, by = c("Quarter", "ShotType") )
compute_the_shot_rate$rate <- compute_the_shot_rate$count / compute_the_shot_rate$total_count *100
# Filter out only they made
compute_the_shot_rate <- compute_the_shot_rate %>%
subset( compute_the_shot_rate$ShotOutcome == "make" )
return(compute_the_shot_rate)
}
This function replaces the specific DataFrame NBA_19_20_SAF_March_11 with a generic df, making it adaptable to different datasets.
play_by_nyk_with_shot_rate <- compute_shot_rate_per_quarter(play_by_nyk)
## `summarise()` has grouped output by 'Quarter', 'ShotType'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'Quarter'. You can override using the
## `.groups` argument.
play_by_nyk_with_shot_rate$team <- "NYK"
play_by_nyk_with_shot_rate
## # A tibble: 17 × 7
## # Groups: Quarter, ShotType [17]
## Quarter ShotType ShotOutcome count total_count rate team
## <int> <chr> <chr> <int> <int> <dbl> <chr>
## 1 1 2-pt dunk make 2 2 100 NYK
## 2 1 2-pt jump shot make 3 6 50 NYK
## 3 1 2-pt layup make 5 8 62.5 NYK
## 4 1 3-pt jump shot make 2 8 25 NYK
## 5 2 2-pt dunk make 4 5 80 NYK
## 6 2 2-pt jump shot make 5 10 50 NYK
## 7 2 2-pt layup make 2 3 66.7 NYK
## 8 2 3-pt jump shot make 3 5 60 NYK
## 9 3 2-pt dunk make 5 5 100 NYK
## 10 3 2-pt jump shot make 3 8 37.5 NYK
## 11 3 2-pt layup make 1 2 50 NYK
## 12 3 3-pt jump shot make 2 5 40 NYK
## 13 4 2-pt jump shot make 6 13 46.2 NYK
## 14 4 3-pt jump shot make 2 6 33.3 NYK
## 15 5 2-pt dunk make 2 2 100 NYK
## 16 5 2-pt jump shot make 1 1 100 NYK
## 17 5 3-pt jump shot make 2 4 50 NYK
play_by_atl_with_shot_rate <- compute_shot_rate_per_quarter(play_by_atl)
## `summarise()` has grouped output by 'Quarter', 'ShotType'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'Quarter'. You can override using the
## `.groups` argument.
play_by_atl_with_shot_rate$team <- "ATL"
play_by_atl_with_shot_rate
## # A tibble: 19 × 7
## # Groups: Quarter, ShotType [19]
## Quarter ShotType ShotOutcome count total_count rate team
## <int> <chr> <chr> <int> <int> <dbl> <chr>
## 1 1 2-pt dunk make 1 1 100 ATL
## 2 1 2-pt jump shot make 4 9 44.4 ATL
## 3 1 2-pt layup make 1 4 25 ATL
## 4 1 3-pt jump shot make 3 14 21.4 ATL
## 5 2 2-pt dunk make 2 2 100 ATL
## 6 2 2-pt jump shot make 2 9 22.2 ATL
## 7 2 2-pt layup make 4 8 50 ATL
## 8 2 3-pt jump shot make 3 6 50 ATL
## 9 3 2-pt dunk make 2 4 50 ATL
## 10 3 2-pt jump shot make 3 6 50 ATL
## 11 3 2-pt layup make 4 8 50 ATL
## 12 3 3-pt jump shot make 2 7 28.6 ATL
## 13 4 2-pt dunk make 2 2 100 ATL
## 14 4 2-pt jump shot make 3 4 75 ATL
## 15 4 2-pt layup make 3 4 75 ATL
## 16 4 3-pt jump shot make 5 9 55.6 ATL
## 17 5 2-pt jump shot make 1 2 50 ATL
## 18 5 2-pt layup make 2 2 100 ATL
## 19 5 3-pt jump shot make 1 5 20 ATL
So at the end we wanted to compare side by side, so we can union these two tables into one.
Let’s restructure the data into a more readable format. We’ll exclude unnecessary columns, pivot the data by quarter, and combine the results for both teams.
Excluding three columns (ShotOutcome, count, total_count)
pivot our data with Quarter with success rate
union two data with rbind() function.
ATL_organized <- play_by_atl_with_shot_rate %>%
subset(select = -c(ShotOutcome, count, total_count) ) %>%
pivot_wider(
names_from = Quarter,
values_from = rate
)
NYK_organized <- play_by_nyk_with_shot_rate %>%
subset(select = -c(ShotOutcome, count, total_count) ) %>%
pivot_wider(
names_from = Quarter,
values_from = rate
)
Agg_ATL_NYK_shot_rate_per_q <- rbind(ATL_organized, NYK_organized)
Agg_ATL_NYK_shot_rate_per_q
## # A tibble: 8 × 7
## # Groups: ShotType [4]
## ShotType team `1` `2` `3` `4` `5`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2-pt dunk ATL 100 100 50 100 NA
## 2 2-pt jump shot ATL 44.4 22.2 50 75 50
## 3 2-pt layup ATL 25 50 50 75 100
## 4 3-pt jump shot ATL 21.4 50 28.6 55.6 20
## 5 2-pt dunk NYK 100 80 100 NA 100
## 6 2-pt jump shot NYK 50 50 37.5 46.2 100
## 7 2-pt layup NYK 62.5 66.7 50 NA NA
## 8 3-pt jump shot NYK 25 60 40 33.3 50
Let’s view this data in table format.
#making a tab
kable(Agg_ATL_NYK_shot_rate_per_q, row.names = F) %>%
column_spec (1:6, border_left = T, border_right = T) %>%
kable_styling()
| ShotType | team | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|---|
| 2-pt dunk | ATL | 100.00000 | 100.00000 | 50.00000 | 100.00000 | NA |
| 2-pt jump shot | ATL | 44.44444 | 22.22222 | 50.00000 | 75.00000 | 50 |
| 2-pt layup | ATL | 25.00000 | 50.00000 | 50.00000 | 75.00000 | 100 |
| 3-pt jump shot | ATL | 21.42857 | 50.00000 | 28.57143 | 55.55556 | 20 |
| 2-pt dunk | NYK | 100.00000 | 80.00000 | 100.00000 | NA | 100 |
| 2-pt jump shot | NYK | 50.00000 | 50.00000 | 37.50000 | 46.15385 | 100 |
| 2-pt layup | NYK | 62.50000 | 66.66667 | 50.00000 | NA | NA |
| 3-pt jump shot | NYK | 25.00000 | 60.00000 | 40.00000 | 33.33333 | 50 |
It appears that visualizing this information through plotting might provide a clearer understanding than examining the table itself.
As you can see, the dataset “Agg_ATL_NYK_shot_rate_per_q” currently has eight rows and seven columns. However, using numbers as column names is inefficient and can lead to confusion. For instance, referencing a column as “1” might be misinterpreted by the computer. Consider the following code:
ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, 1, color = team)) + geom_point(size = 10) + scale_colour_manual(name="",
values = c("NYK"="orange", "ATL"="red"))
You might assume this plot displays the success rate for each team, but it actually plots the value 1 on the y-axis.
To avoid this issue, we’ll rename the numeric column names to include “Q” for quarter. Let’s examine the current column names:
colnames(Agg_ATL_NYK_shot_rate_per_q)
## [1] "ShotType" "team" "1" "2" "3" "4" "5"
As expected, we have seven columns: “ShotType”, “team”, “1”, “2”, “3”, “4”, and “5”. We’ll modify these to “Q1”, “Q2”, “Q3”, “Q4”, and “Q5” while preserving the original order.
colnames(Agg_ATL_NYK_shot_rate_per_q) <- c("ShotType", "team", "Q1", "Q2", "Q3", "Q4", "Q5" )
It’s crucial to maintain the original column order for accurate data analysis.
Alternatively, the gsub() function with regular expressions can be used to rename columns without worrying about order.
This section will explore the shooting success of the Atlanta Hawks (ATL) and New York Knicks (NYK) throughout the game, focusing on each quarter and overtime. We’ll utilize visualizations to compare their performance across different shot types (2pt Dunk, 2pt Jump Shot, 2pt Layup, and 3pt Shot).
First Quarter
The opening quarter favored the NYK. Our plot (Figure 1) reveals they had a higher success rate for every shot type compared to ATL.
ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, Q1, color = team)) + geom_point(size = 10) + scale_colour_manual(name="",
values = c("NYK"="orange", "ATL"="red"))+ ggtitle("Figure 1: Q1") + ylab("success rate for every shot")
Second Quarter
The second quarter showcased a more competitive battle. While ATL edged out NYK on 2pt Dunk attempts (Figure 2), NYK dominated in 2pt Jump Shots, Layups, and 3pt Shots.
ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, Q2, color = team)) + geom_point(size = 10) + scale_colour_manual(name="",
values = c("NYK"="orange", "ATL"="red"))+ ggtitle("Figure 2: Q2") + ylab("success rate for every shot")
Overall First Half
Based on the combined data from the first two quarters (Figures 1 & 2), NYK held a slight advantage in shooting success during the first half.
Third Quarter
The third quarter saw a dip in ATL’s performance, particularly with missed 2pt Dunk opportunities (Figure 3). However, they managed a slight edge in 2pt Layups.
ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, Q3, color = team)) + geom_point(size = 10) + scale_colour_manual(name="",
values = c("NYK"="orange", "ATL"="red"))+ ggtitle("Figure 3: Q3") + ylab("success rate for every shot")
Fourth Quarter and Overtime
Since the game went into overtime, we anticipated ATL to potentially outperform NYK in the fourth quarter. This was indeed the case! Figure 4 reveals ATL dominated across all shot categories during the fourth quarter.
ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, Q4, color = team)) + geom_point(size = 10) + scale_colour_manual(name="",
values = c("NYK"="orange", "ATL"="red")) + ggtitle("Figure 4: Q4") + ylab("success rate for every shot")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Overtime presented a more balanced picture (Figure 5). ATL maintained an edge in 2pt Layups, but NYK held its own in other categories.
ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, Q5, color = team)) + geom_point(size = 10) + scale_colour_manual(name="",
values = c("NYK"="orange", "ATL"="red"))+ ggtitle("Figure 5: Overtime") + ylab("success rate for every shot")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
While these visualizations provide valuable insights, consider incorporating additional data representations like tables. These can offer a more detailed breakdown of shooting success for each team and shot type across all quarters.
By analyzing shooting success by quarter, we can identify strengths and weaknesses for each team throughout the game. This analysis can be further enhanced by incorporating additional factors like defensive strategies and player fatigue.
Kudos
Thank you!
Presenting to the Lab
When presenting these findings to your lab, begin by providing a brief introduction. Explain the purpose of your analysis, which could be something like “I wanted to investigate the shooting performance of ATL and NYK throughout the game.”
When presenting figures, take a moment to explain the axes, colors, and shapes used. For example, you could say, “The x-axis represents the shot type, the y-axis indicates shooting success rate, and the colors orange and red represent NYK and ATL, respectively.”
**3. Interpretation and Conclusions
Think about the conclusion of your figures / tables and tell people what do you think. This might be the most important step before you present this to the lab members, if you don’t know it’s okay to say I don’t know, but try to think.